Credit Card Fraud Detection: Leveraging Random Forest and Stacking

In [1]:
import random
from seaborn.palettes import color_palette
random.seed(9001)
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import warnings
from sklearn.model_selection import cross_val_score, RepeatedKFold
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor, BaggingRegressor, RandomForestRegressor, StackingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
  1. Load the dataset
In [2]:
#from google.colab import drive
#drive.mount('/content/drive')

path = "/content/creditcard.csv"
creditcard = pd.read_csv(path)
  1. Show first 6 data points
In [3]:
creditcard.head(6)
Out[3]:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0
5 2.0 -0.425966 0.960523 1.141109 -0.168252 0.420987 -0.029728 0.476201 0.260314 -0.568671 ... -0.208254 -0.559825 -0.026398 -0.371427 -0.232794 0.105915 0.253844 0.081080 3.67 0

6 rows × 31 columns

  1. describe pandas dataframe
In [ ]:
creditcard.describe()
Out[ ]:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
count 284807.000000 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 ... 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 284807.000000 284807.000000
mean 94813.859575 1.168375e-15 3.416908e-16 -1.379537e-15 2.074095e-15 9.604066e-16 1.487313e-15 -5.556467e-16 1.213481e-16 -2.406331e-15 ... 1.654067e-16 -3.568593e-16 2.578648e-16 4.473266e-15 5.340915e-16 1.683437e-15 -3.660091e-16 -1.227390e-16 88.349619 0.001727
std 47488.145955 1.958696e+00 1.651309e+00 1.516255e+00 1.415869e+00 1.380247e+00 1.332271e+00 1.237094e+00 1.194353e+00 1.098632e+00 ... 7.345240e-01 7.257016e-01 6.244603e-01 6.056471e-01 5.212781e-01 4.822270e-01 4.036325e-01 3.300833e-01 250.120109 0.041527
min 0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -5.683171e+00 -1.137433e+02 -2.616051e+01 -4.355724e+01 -7.321672e+01 -1.343407e+01 ... -3.483038e+01 -1.093314e+01 -4.480774e+01 -2.836627e+00 -1.029540e+01 -2.604551e+00 -2.256568e+01 -1.543008e+01 0.000000 0.000000
25% 54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01 -8.486401e-01 -6.915971e-01 -7.682956e-01 -5.540759e-01 -2.086297e-01 -6.430976e-01 ... -2.283949e-01 -5.423504e-01 -1.618463e-01 -3.545861e-01 -3.171451e-01 -3.269839e-01 -7.083953e-02 -5.295979e-02 5.600000 0.000000
50% 84692.000000 1.810880e-02 6.548556e-02 1.798463e-01 -1.984653e-02 -5.433583e-02 -2.741871e-01 4.010308e-02 2.235804e-02 -5.142873e-02 ... -2.945017e-02 6.781943e-03 -1.119293e-02 4.097606e-02 1.659350e-02 -5.213911e-02 1.342146e-03 1.124383e-02 22.000000 0.000000
75% 139320.500000 1.315642e+00 8.037239e-01 1.027196e+00 7.433413e-01 6.119264e-01 3.985649e-01 5.704361e-01 3.273459e-01 5.971390e-01 ... 1.863772e-01 5.285536e-01 1.476421e-01 4.395266e-01 3.507156e-01 2.409522e-01 9.104512e-02 7.827995e-02 77.165000 0.000000
max 172792.000000 2.454930e+00 2.205773e+01 9.382558e+00 1.687534e+01 3.480167e+01 7.330163e+01 1.205895e+02 2.000721e+01 1.559499e+01 ... 2.720284e+01 1.050309e+01 2.252841e+01 4.584549e+00 7.519589e+00 3.517346e+00 3.161220e+01 3.384781e+01 25691.160000 1.000000

8 rows × 31 columns

In [ ]:
len(creditcard)
Out[ ]:
284807
  1. show correlation heat plot of the entire dataset using matplotlib and sns, choose any collor pallet (except blue) you like
In [ ]:
corr = creditcard.corr()

plt.figure(figsize=(15, 10))
sns.heatmap(corr, cmap="YlOrRd", annot=True)
plt.title("Correlation Heatmap of Credit Card Dataset")
plt.show()
  1. show the Scatterplot matrix for the dataframe
In [ ]:
import random

random_columns = random.sample(list(creditcard.columns), 10)
sns.pairplot(creditcard[random_columns])
#sns.pairplot(creditcard)

plt.suptitle("Scatterplot Matrix for 10 Randomly Selected Columns", y=1.02)
plt.show()
In [ ]:
import plotly.express as px

creditcard['Class'] = creditcard['Class'].astype('category')

color_scale = {0: 'blue', 1: 'red'}

fig = px.scatter_matrix(creditcard,
    dimensions=["V2", "V4", "V8", "V11", "V19", "V20", 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27'],
    color="Class",
    color_discrete_map=color_scale,
    opacity=0.7,
    title="Scatterplot Matrix of Selected Features for credit card fraud detection",
    height=800,
    width=1000
)

fig.update_traces(marker=dict(size=4))


fig.show()

fig.write_html("Scatterplot_Matrix.html")
In [47]:
##in case the image above is not showing in HTML file (tested once and didn't display)
from IPython.display import Image

Image('scatterplot_matrix_selected_features.png')
Out[47]:
  1. split the dataset into the training set and test set. Justify the rationale
In [7]:
from sklearn.model_selection import train_test_split

X = creditcard.drop(['Class','Amount','Time'], axis=1)
y = creditcard['Class']  # Target variable

## Doing 80-20 split and applying stratefied sampling given the imbalanced dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify = y )

print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")
Training set size: 227845
Testing set size: 56962

I chose to conduct a 80-20 split with stratefied sampling given the imbalanced dataset of the credit card data (the mean of "Class" variable was 0.001727 given the binary variable).

  1. perform classification routine and output the accuracy box plot (make sure to change regressmod df to classmod. And use an appropriate metric for classification evaluation, for exmaple, accuracy, precision, recall etc).
In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, GradientBoostingClassifier
import xgboost as xgb
from sklearn.model_selection import cross_val_score, RepeatedKFold, cross_validate
In [5]:
#perform classification routine

def base_models():
    models = dict()
    models['Logistic Regression'] = LogisticRegression()
    models['Decision Tree'] = DecisionTreeClassifier()
    models['SVC'] = SVC()
    models['GaussianNB'] = GaussianNB()
    models['Random Forest'] = RandomForestClassifier()
    models['Bagging'] = BaggingClassifier()
    models['Gradient Boosting'] = GradientBoostingClassifier()
    models['XGBoost'] = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')
    return models

The evaluation model inveolves 5 folds and 3 repeats cross validation while the metrics being used are precision, recall and F1 score. Given the dataset is highly unbalanced, accuracy is not a good choice for evaluation. Precision and Recall will allow us to focus on the positive cases (fraud) and there is also a trade-off between precision and recall, where the F-1 socre might be helpful as it is the harmonic mean.

In [ ]:
def eval_models(model):
    cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)
    scoring_metrics = ['precision', 'recall', 'f1']
    scores = cross_validate(model, X_train, y_train, scoring=scoring_metrics, cv=cv, n_jobs=-1, error_score='raise')
    return scores

models = base_models()

# For clarity in output
metric_names = ['test_precision', 'test_recall', 'test_f1']

for name, model in models.items():
    scores = eval_models(model)
    print(f'Metrics for {name}:')
    for metric in metric_names:
        mean_score = scores[metric].mean()
        std_score = scores[metric].std()
        print(f'{metric}: {mean_score:.3f} ({std_score:.3f})')
    print('-' * 50)
Metrics for Logistic Regression:
test_precision: 0.884 (0.030)
test_recall: 0.637 (0.067)
test_f1: 0.738 (0.044)
--------------------------------------------------
Metrics for Decision Tree:
test_precision: 0.746 (0.049)
test_recall: 0.754 (0.062)
test_f1: 0.747 (0.033)
--------------------------------------------------
Metrics for SVC:
test_precision: 0.943 (0.031)
test_recall: 0.673 (0.061)
test_f1: 0.783 (0.038)
--------------------------------------------------
Metrics for GaussianNB:
test_precision: 0.062 (0.005)
test_recall: 0.826 (0.041)
test_f1: 0.115 (0.008)
--------------------------------------------------
Metrics for Random Forest:
test_precision: 0.939 (0.029)
test_recall: 0.773 (0.052)
test_f1: 0.847 (0.030)
--------------------------------------------------
Metrics for Bagging:
test_precision: 0.929 (0.029)
test_recall: 0.761 (0.052)
test_f1: 0.835 (0.029)
--------------------------------------------------
Metrics for Gradient Boosting:
test_precision: 0.787 (0.095)
test_recall: 0.485 (0.231)
test_f1: 0.574 (0.219)
--------------------------------------------------
Metrics for XGBoost:
test_precision: 0.944 (0.026)
test_recall: 0.786 (0.050)
test_f1: 0.856 (0.027)
--------------------------------------------------

The result shows the XGBoost has the highest f1 score, with also high precision. Random Forest would be the next best option.

Accuracy box plot and use an appropriate metric for classification evaluation.

In [46]:
def eval_models(model):
    cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)
    scores = cross_val_score(model, X_train, y_train, scoring='f1', cv=cv, n_jobs=-1, error_score='raise')
    return scores

models = base_models()

all_scores = []

for name, model in models.items():
    scores = eval_models(model)
    all_scores.append(scores)

model_names = list(models.keys())

df_scores = pd.DataFrame(np.transpose(all_scores), columns=model_names)

# Melt the DataFrame for visualization
classifmod = pd.melt(df_scores.reset_index(), id_vars='index', value_vars=model_names)

# Create the box plot
fig = px.box(classifmod, x="variable", y="value", color="variable", points='all',
             labels={"variable": "Machine Learning Model",
                     "value": "F1 Score"
                     }, title="Model Performance")
fig.show()
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-46-c1a62cda79a9> in <cell line: 10>()
      9 
     10 for name, model in models.items():
---> 11     scores = eval_models(model)
     12     all_scores.append(scores)
     13 

<ipython-input-46-c1a62cda79a9> in eval_models(model)
      1 def eval_models(model):
      2     cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)
----> 3     scores = cross_val_score(model, X_train, y_train, scoring='f1', cv=cv, n_jobs=-1, error_score='raise')
      4     return scores
      5 

/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
    513     scorer = check_scoring(estimator, scoring=scoring)
    514 
--> 515     cv_results = cross_validate(
    516         estimator=estimator,
    517         X=X,

/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
    264     # independent, and that it is pickle-able.
    265     parallel = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)
--> 266     results = parallel(
    267         delayed(_fit_and_score)(
    268             clone(estimator),

/usr/local/lib/python3.10/dist-packages/sklearn/utils/parallel.py in __call__(self, iterable)
     61             for delayed_func, args, kwargs in iterable
     62         )
---> 63         return super().__call__(iterable_with_config)
     64 
     65 

/usr/local/lib/python3.10/dist-packages/joblib/parallel.py in __call__(self, iterable)
   1950         next(output)
   1951 
-> 1952         return output if self.return_generator else list(output)
   1953 
   1954     def __repr__(self):

/usr/local/lib/python3.10/dist-packages/joblib/parallel.py in _get_outputs(self, iterator, pre_dispatch)
   1593 
   1594             with self._backend.retrieval_context():
-> 1595                 yield from self._retrieve()
   1596 
   1597         except GeneratorExit:

/usr/local/lib/python3.10/dist-packages/joblib/parallel.py in _retrieve(self)
   1705                 (self._jobs[0].get_status(
   1706                     timeout=self.timeout) == TASK_PENDING)):
-> 1707                 time.sleep(0.01)
   1708                 continue
   1709 

KeyboardInterrupt: 
In [44]:
#I ran the code successfully once and got the box plot screenshot. The chunk of codes took a while to run, I have attached the result below from the previous tryout.

from IPython.display import Image

Image('model_performance_boxplot.png')
Out[44]:
Out[44]:
  • I decided to use the F1 score because it ensures both false positives and false negatives are considered in the metric. A good F1 score indicates a balance between capturing most of the actual fradulent transactions (high recall) and not flagging too many legitimate transactions as fraudulent (high precision).

  • As we can see from the result: GuassianNB and Gradient Boosting are the worst performers among the models. XGBoost and Random Forest are the best performing models. XGBoost has the highest median F1 score, and most of its individual scores are above 0.8. It also has a relatively tight spread compared to other high performing models. Therefore, I will select XGBoost for the base level classifier.

  1. Stacked Models: Select the best classifier for level o classifier. Use logistic regression as a second level classifier. Similar to 5 generate the box plot and show the accuracy of each algorithm as well as stacked classifier. Also show the confusion metrices of the above algorithms
In [16]:
import xgboost as xgb
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, make_scorer, f1_score
import pandas as pd
import plotly.express as px


# Level 0 classifiers: XGBoost
estimators = [
    ('xgb', xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')),
]

# Level 2 classifiers: logistics regression
clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())

# F1 scores
f1_scores = cross_val_score(clf, X_train, y_train, cv=5, scoring=make_scorer(f1_score))

# Fit and evaluate stacked classifier
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Metrics for Stacked Classifier:")
print(classification_report(y_test, y_pred))

print("Confusion matrix for Stacked Classifier:")
cm = confusion_matrix(y_test, y_pred)
print(cm)

classifmod = pd.DataFrame(f1_scores, columns=['Stacked Model'])

# Melt the DataFrame for visualization
classifmod_melted = pd.melt(classifmod.reset_index(), id_vars='index', value_vars=['Stacked Model'])

# Plot the scores using plotly.express
fig = px.box(classifmod_melted, x="variable", y="value", color="variable", points='all',
             labels={"variable": "Machine Learning Model",
                     "value": "F1 Score"
                     }, title="Model Performance for Stacked Model")
fig.show()
Metrics for Stacked Classifier:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     71079
           1       0.95      0.76      0.85       123

    accuracy                           1.00     71202
   macro avg       0.97      0.88      0.92     71202
weighted avg       1.00      1.00      1.00     71202

Confusion matrix for Stacked Classifier:
[[71074     5]
 [   29    94]]
In [50]:
#tested and realized the image is not displaying in HTML, therefore, attaching the screenshot below
from IPython.display import Image

Image('stacked_model_performance.png')
Out[50]:
  • In terms of predicting fraud, the stacked model has overall 0.85 F1 score with 0.95 Precision and 0.76 recall. The confusion matrix shows that 5 instances were incorrectly predicted as fraud cases (false positive) and 29 frauds were incorrectly predicted as normal cases (false negative). Still, given the imbalanced dataset, the stacked model performed quite well.
  1. Export the Pickle model and import it back. Use the imported model to predic the y_test from x_test and report the confusion metrices.
In [23]:
import pickle

level0 = list()
level0.append(('xgb', xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')))

level1 = LogisticRegression()
model = StackingClassifier(estimators=level0, final_estimator=level1, cv=5)
model.fit(X_train, y_train)

# Save to file in the current working directory
pkl_filename = "AssignmentPickle.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(model, file)

# Load from file
with open(pkl_filename, 'rb') as file:
    pickle_model = pickle.load(file)

score = pickle_model.score(X_test, y_test)
print("Test score: {0:.2f} %".format(100 * score))
Y_predict = pickle_model.predict(X_test)
Test score: 99.95 %
  • The 99.95% test score shows that the stacked model had a great accuracy in predicting the test set.
  1. Show both text and visual confusion Matrices using scikit learn and matplotlib and explain what the graph tells you and what you did
In [27]:
import matplotlib.pyplot as plt
import seaborn as sns

print("Confusion matrix for Stacked Classifier:")
cm = confusion_matrix(y_test, Y_predict)
print(cm)

plt.figure(figsize=(8,8))
sns.heatmap(cm, annot=True, fmt="d", cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix', fontsize=10)
plt.show()
Confusion matrix for Stacked Classifier:
[[71074     5]
 [   29    94]]

Convert the file into HTML

In [40]:
import nbconvert

!pwd
/content
In [49]:
!jupyter nbconvert --to html  /content/ANLY6700_Assignment2_NW.ipynb
[NbConvertApp] Converting notebook /content/ANLY6700_Assignment2_NW.ipynb to html
[NbConvertApp] Writing 24206727 bytes to /content/ANLY6700_Assignment2_NW.html